Other data wrangling functions

Published

August 2, 2023

1 Other important functions for data wrangling.

In this section we will explore a few other functions that you most likely are going to need.

  • _join()
  • pivot_wider() and pivot_longer()









1.1 Getting started for now

Restart R and load the tidyverse and the here package

Show the code
library(tidyverse)
library(here)









1.2 Wide and Long data formats

In the realm of data science, it’s common to encounter data in what we call a “wide” format. To visualize this, imagine that each measurement of the same attribute is spread across multiple columns instead of being listed down in rows. Consider a scenario where we record a person’s weight at different intervals; in the wide format, we might see individual columns dedicated to each time point, like weightkg_week0, weightkg_week10 and so forth. Essentially, the column labels often carry dual information: the attribute (in this case, ‘weight’) and the condition or time (like ‘week0’ or ‘week10’).

Often, data is inputted in this wide format for convenience, as it visually follows the chronological order of data collection or event occurrence. However, in the practice of data science, it’s generally more efficient to work with data in what we call a “long” format.

In long format, the data is structured such that each row is a unique observation or instance, and each column represents a unique variable or attribute. Continuing with the weight example, instead of having separate columns for each week’s weight (weightkg_week0, weightkg_week10), we would have separate rows for each weight measurement, with a column for the weight (weightkg) and another column for the time point (week).

The long format, also known as “tidy” data, provides a systematic and consistent structure, allowing us to work with data more effectively. Plus, many functions and packages in R, such as the popular ggplot2 for data visualization, are designed to work optimally with this long format.

But how do we move from wide format to long format? And vice versa? This is where the pivot_longer() and pivot_wider() functions from the tidyverse in R come into play.

Run the code below and explore the dataset that it creates

The code creates a made-up data set. You dont need to understand the code.

  • Is the data frame in a wide or long format? How can you tell?
set.seed(1)

weight_data <- tibble(
  id = 1:100,
  sex = sample(c("Male", "Female"), size = 100, replace = TRUE),
  weightkg_week0 = rnorm(100, mean = 95, sd = 5)) %>% 
  
  # Mean wiegh more than women
  mutate(
    weightkg_week0 = if_else(
      sex == "Male", weightkg_week0 * rnorm(100, mean = 1.15, sd = 0.3),
      weightkg_week0
    )
  ) %>% 
  
  # Weight change over weeks
  mutate(
    weightkg_week1 = rnorm(100, mean = weightkg_week0*0.96, sd = 0.5), 
    weightkg_week2 = rnorm(100, mean = weightkg_week1*0.97, sd = 0.5),
    weightkg_week3 = rnorm(100, mean = weightkg_week2*0.98, sd = 0.5), 
    weightkg_week4 = rnorm(100, mean = weightkg_week3*0.99, sd = 0.5),
    weightkg_week5 = rnorm(100, mean = weightkg_week4, sd = 0.5))
  

weight_data

What is the mean weight of men and women at the different time-points?

This exercise is only here to show you how troublesome it can be to work with data in a wide format. You don’t need to write all the code.

Show the code
weight_data %>% 
  group_by(sex) %>% 
  summarise(week0 = mean(weightkg_week0),
            week1 = mean(weightkg_week1),
            # etc. etc.
            )

1.3 pivot_longer

The pivot_longer() function takes wide format data and makes it longer by increasing the number of rows and decreasing the number of columns. For example, it can take our weight data from multiple columns (weightkg_week0, weightkg_week10) and collapse them into two columns: one for weight and one for timepoint.

Read the help file for pivot_longer()

Show the code
?pivot_longer
  • What does this line in the help file mean?

    cols <tidy-select> Columns to pivot into longer format.

  • What does the names_to and values_to arguments do?

Show the code
# This means that you use all the functions that you also use with select() when you tell R what columns you to pivot into a longer format.


Create a long version of weight_data called weight_data_long

Show the code
weight_data_long <- weight_data %>% 
  pivot_longer(cols = starts_with("weight"))

weight_data_long

Try again, but this time use the names_to and values_to arguments to make names for the new columns

Show the code
weight_data_long <- weight_data %>% 
  pivot_longer(cols = starts_with("weight"),
               names_to = "week",
               values_to = "weightkg")
weight_data_long

We will fix the values in the week column later. For now we are happy with the long data format.

What is the mean and standard deviation of weight of men and women at the different time-points?

  • assign you results to an object called weight_data_long_sum
Show the code
weight_data_long_sum <- weight_data_long %>% 
  group_by(sex, week) %>% 
  summarise(mean = mean(weightkg))

weight_data_long_sum

1.4 pivot_wider()

The pivot_wider() function takes long format data and makes it wider by decreasing the number of rows and increasing the number of columns. For instance, it can spread our long format weight data back into multiple columns for each week’s weight.

Read the help file for pivot_wider()

Show the code
?pivot_wider

Make weight_data_long_sum wider using the week variable

Show the code
weight_data_long_sum %>% 
  pivot_wider(names_from = "week",
              values_from = mean)
# A tibble: 2 × 7
# Groups:   sex [2]
  sex    weightkg_week0 weightkg_week1 weightkg_week2 weightkg_week3
  <chr>           <dbl>          <dbl>          <dbl>          <dbl>
1 Female           94.7           90.9           88.3           86.4
2 Male            113.           108.           105.           103. 
# ℹ 2 more variables: weightkg_week4 <dbl>, weightkg_week5 <dbl>

Make weight_data_long_sum wider using the sex variable

Show the code
weight_data_long_sum %>% 
  pivot_wider(names_from = "sex",
              values_from = mean)
# A tibble: 6 × 3
  week           Female  Male
  <chr>           <dbl> <dbl>
1 weightkg_week0   94.7  113.
2 weightkg_week1   90.9  108.
3 weightkg_week2   88.3  105.
4 weightkg_week3   86.4  103.
5 weightkg_week4   85.5  102.
6 weightkg_week5   85.4  102.

1.5 More examples of pivot_wider() and pivot_longer()

Tidy the simple tibble below.

Change it into three variables:

  • sex (“female”, “male”)
  • pregnant (“yes”, “no”)
  • count, which is a non-negative integer representing the number of observations.
preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes", NA, 10,
  "no", 20, 12
)
preg
Show the code
preg %>%
  pivot_longer(c(male, female), names_to = "sex", values_to = "count")


Use table2 for this exercise

table2 is a dataset that is load with the tidyverse. Try to type table2 in the console.

table2 displays the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population).

Change table2 into the output below:

Show the code
table2 %>% 
  pivot_wider(names_from = type,
              values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

1.6 _join()

Run this code to get example data

set.seed(1)
A <- tibble(id = c(1:5, 7:8),
            sex = sample(c("Male", "Female"), 7, replace = TRUE),
            nr_of_orders = sample(5:10, 7, replace = TRUE))
B <- tibble(id = 1:6,
            age = sample(25:75, 6),
            total_order_value = sample(5:50, 6)*100)


What rows doesnt match between A and B

Show the code
anti_join(A, B)
anti_join(B, A)


What will be the results of the below code

semi_join(A, B)


Keep all rows in A and all columns from A and B

Show the code
left_join(A, B)


Keep all rows and all columns from both A and B, and sort by id

Show the code
full_join(A,B) %>% 
  arrange(id)


Calculate the average order value

Show the code
full_join(A,B) %>% 
  mutate(
    avg_order_value = total_order_value/nr_of_orders) 


Calculate the mean of the average order value for males and females

Show the code
full_join(A,B) %>% 
  mutate(
    avg_order_value = total_order_value/nr_of_orders) %>% 
  group_by(sex) %>% 
  summarise(
    mean_avg_order_value = mean(avg_order_value, na.rm = TRUE)
  )